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Mutual correlation between segments of DNA or protein sequences can be detected 
by Smith- Waterman local alignments. We present a statistical analysis of align- 
ment of such sequences, based on a recent scaling theory. A new fidelity measure 
is introduced and shown to capture the significance of the local alignment, i.e., the 
extent to which the correlated subsequences are correctly identified. It is demon- 
strated how the fidelity may be optimized in the space of penalty parameters using 
only the alignment score data of a single sequence pair. 



1 Introduction 

Sequence alignment has become an indispensable tool in molecular biologyB. 
A number of different algorithms are available to date, and their variety and 
complexity continues to growB. For a given application, however, a suitable 
type of algorithm paxid optimal scoring parameters are still chosen mostly on 
an empirical basis clou. The practical problems in the application of alignment 
algorithms reflect a number of poorly understood conceptual issues: Given 
sequences with mutual correlations, how can the fidelity of an alignment - 
i.e., the correlations correctly captured — be quantified? How can the scoring 
parameters be chosen to produce high-fidelity alignments? Are the results 
statistically and biologically significant? 

In a series of recent publications oQ&lI, we have developed a statistical scal- 
ing theory of gapped alignment aimed at addressing these issues. This theory 
describes the dependence of alignment data on the inter-sequence correlations 
and on the scoring parameters used. The entire parameter dependence of align- 
ments is contained in a number of characteristic scales. For Smith- Waterman 
alignmentsllj, the most important scales are the typical length to of mutually 
uncorrelated subsequences locally aligned, and the minimum length t c of mu- 
tually correlated subsequences detectable by alignment. Expressed in terms 
of these characteristic scales, the alignment statistics acquires universal prop- 
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erties independent of the scoring parameters. Hence, optimizing alignments 
reduces to optimizing the values of the characteristic scales. 

In this paper, we study the statistics of Smith- Waterman alignments for 
piecewise correlated sequences. We define a suitable fidelity function weighing 
appropriately aligned pairs of correlated elements against false positives. The 
parameter dependence of the fidelity is found to be captured by the scaling 
theory of alignment. High-fidelity alignments are obtained if the characteristic 
scales io and t c are of the same order of magnitude and are jointly optimized. 
For a given sequence pair, we show how this optimization can be obtained 
directly from the score data, leading to the central result of this paper: a 
simple procedure for optimizing the fidelity of Smith- Waterman alignments. 

2 The Smith- Waterman Algorithm 

We study local alignments of pairs of Markov sequences Q — {Qi} and Q' = 
{Q'j} with an approximately equal number of elements ~ N/2. Each element 
Qi or Q'j is chosen with equal probability from a set of c different letters, in- 
dependently of the other elements of the same sequence. There may, however, 
be inter-sequence correlations in pairs (Qi,Q'j)- We here take c = 4, as is 
appropriate for nucleotide sequences, although the results can be easily gen- 
eralized to arbitrary values of c. An alignment is defined as an ordered set 
of pairings (Qi, Q'j) and of gaps (Qi, — ) and (— , Q'j) involving the elements of 
two contiguous subsequences {Qi l: ■ ■ ■ , Qi 2 } and {Q'j x , ■ ■ ■ , Q'j 2 }> see Fig. 1(a). 
We define the length of an alignment as the total number of aligned elements 
of both sequences, L = i% — i\ +32 — ji- 1— 1 

A given alignment is conveniently represented as a directed path on a 
two-dimensional grid as shown in Fig. 1(b). Using the rotated coordinates 
r = i — j and t = i + j, this path is described by a single- valued function r(t) 
measuring the "displacement" of the path from the diagonal of the alignment 
grid. The length L of the alignment equals the projected length of its path 
onto the diagonal. 

Each alignment is assigned a score 5 1 , maximization of which defines the 
optimal alignment for a given scoring function. The simplest class of linear 
scoring functions is of the form S = <r+N + + <r_ N- + g N g , where N+ is the 
total number of matches (Qi — Q'j), N- the number of mismatches (Qi ^ Q'j), 
N g the number of gaps, and <r+, cr_, a g are the associated scoring parameters. 
Since an overall multiplication of the score does not change the alignment 
result, we can use the normalized scoring function 

S = <jL + yfc^lN+ - — N - 7 N g (1) 
y/c- 1 
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Figure 1: (a) One possible local alignment of two nucleotide sequences Q and Q' . 
The aligned subsequences are shown in boldface, with 4 pairings (three matches, one 
mismatch) and one gap. The alignment contains a total of L — 9 elements, (b) 
Unique representation of the alignment in (a) as directed path r(t) (the thick line) 
on a two-dimensional alignment grid. Each vertical (horizontal) bond of the path 
corresponds to a gap in sequence Q {Q'). L is the projected length onto the t axis. 



with L = 2N + + 2N_ +N g denoting again the alignment length defined above. 
This form of the scoring function contains the two natural scoring parameters: 
the score gain a per aligned element, and the gap cost 7. The parameter a 
controls the length L of the optimal alignment; while changing 7 affects its 
number of gaps, i.e., the mean square displacement of the optimal alignment 
path from the diagonal of the alignment grid. (Borrowing notions from physics 
and chemistry, we can think of the alignment path r(t) as a polymer stretched 
along the t axis, with "chemical potential" a and "line tension" 7.) 
We use the Smith- Waterman recursion relationEJ 

S(r- l,t- 1) + 0--7 

with 



and suitable boundary conditions i. S(r,t) is the score maximum for the 
set of all alignment paths ending at the point (r, t) . The optimal alignment 
ends at the point (^,£2) defined by the global score maximum, S(r2,t2) — 
max ri( S(r, t). The entire path is then traced back from the endpoint to the 
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initial point (ri,ii) given by S(r±,ti) = 0. The length of the optimal path is 
L = t% — t\. For large values of a, the optimal alignment of long sequences 
becomes a so-called global alignment involving the entire sequences Q and Q' up 
to small unpaired regions at both ends; i.e., L ~ N. In this limit, the Smith- 
Waterman_algorithm becomes equivalent to the simpler Needleman-Wunsch 
algorithm Ell. 



3 Scaling of Smith- Waterman alignments 

The statistical theory of alignment describes averages (denoted by overbars) 
over an ensemble of sequence pairs with well-defined mutual correlations. How- 
ever, we emphasize that the properties of single pairs of "typical" sequences 
are well approximated by these averages!! 

The simplest form of scaling is realized in the limit of global alignment 
(a — > oo) for pairs of Markov sequences without mutual correlations. Im- 
portant statistical averages then scale as powers of the sequence length; for 
example, the variance of the optimal score (AS 1 ) 2 oc iV 2 / 3 . The exponents of 
these power laws are universal, i.e., independent of the scoring parameters. A 
detailed discussion was given by Drasdo et al a. 

For generic values of a, the alignment statistics becomes more complicated 
even for mutually uncorrelated sequences. Most importantly, there is a phase 
transition E3 along a critical line a — o~ c (j). For a > a c , the optimal alignment 
of long sequences remains global; i.e., it has asymptotic length L ~ N and 
score S oc N for N ^> 1. This is called the linear phase. For a < o~ c , however, 
the optimal alignment ending at a given point (r, t) remains finite. The limit 
values of its average length and score, to = lim^oo L(t) and So = limt^oo S(t), 
are characteristic scales asymptotically independent of the sequence length N. 
(The argument r has been suppressed since these averages are independent of 
it.) The global optimal alignment path is then of length L ~ to log N, which 
gives the name logarithmic phase to the regime a < o~ c . 

Close to the phase transition, the characteristic scales themselves diverge 
as powers of the distance 5a = a — cr c {"f) to the critical linetl, 

t (a,j) ~ £? 3 / 2 ( 7 )|fc|- 3 / 2 , S (o-n) - B 8 / a (7)IH~ 1/a ■ ( 4 ) 

(Here ~ denotes proportionality with a (a, 7)-independent proportionality con- 
stant.) The-ippemcient function B(j) and the critical line er c (7) are known 
numerically BH. In this region, the average length and score take the scaling 
formQ __ _ 

™ m=sj±). (5) 
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The subscript of the scaling functions £ and S refers to the sign of 5a; the two 
branches correspond to the linear and the logarithmic phase, respectively. The 
entire dependence on the scoring parameters is contained in the characteristic 
scales (K), while the scaling functions S± and £± are again universal. The 
meaning of the scaling form (|J) is quite simple: It relates alignment data for 
different values of the scoring parameters. This leads to the data collapse of 
Fig. 2. 
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Fig. 2: Local alignment of sequences without mutual correlations, (a) Average score 
S{t) (over an ensemble of 1000 random sequence pairs of 10000 elements each) of the 
optimal alignment for various scoring parameters. The curves correspond to 7 = 3.0 
and Sa/<j c (-y) = 0.05 to —0.05 (top to bottom), (b) The scaled curves S(t)/So as 
functions of t/to collapse to the universal two-branched function S± of Eq. (^). The 
asymptotics ot-this function is given by power laws (dashed lines) predicted by the 
scaling theory 0. 
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We now turn to alignments of Markov sequences Q and Q' with mutually 
correlated subsequences Q and Q' (referred to below as target) of approxi- 
mately equal length N/2. The "daughter" sequence Q' is obtained from the 
"ancestor" sequence Q by a simple Markov evolution process □ with substitu- 
tion probability p and insertion/deletion probability q. The average fraction 
U = (1 — p)(l — q) of ancestor elements conserved in the daughter sequence 
quantifies the degree of correlations between Q and Q' . The remainder of Q 
and Q' has no correlations. 

A meaningful alignment of the sequences Q and Q' should (i) match a fair 
fraction / of the pairs of conserved elements (Qi,Q'j) € Q X Q' and (ii) re- 
main confined to the target region to avoid false matches. We quantify these 
properties by the fidelity junction 

2N 

which takes values between and 1. The prefactor is designed to penalize 
local alignments that are too long (L > N). Its precise form influences the 
parameter dependence of the fidelity only weakly. For global alignments, T 
reduces to the fidelity function used previously T = f. Maximizing T for 
a given pair of sequences should produce an alignment of bona fide biological 
significance. 

Alignments of correlated sequences have-, a second set of characteristic 
scales related to their statistical significance u. The threshold or correlation 
length f c (7) is the minimal length of a target to be detectable statistically by 
alignment ]]. (t c also depends on the evolution parameters, in the present case 
U and q, but is independent of a.) In the sequel, we study targets of length 
N well above t c and well below the overall length N. The relevant ensemble 
averages can then again be written in scaling form. For the fidelity and the 
length of the optimal alignment, we expect the approximate expressions 

7 Jk),l=c(±), (7) 



^*(7) VW ' N Vo 

where J 7 * (7) = max CT !F(o~, 7) denotes the relative fidelity maximum at a given 
value of 7. The important point of this scaling form is again quite simple: 
It relates alignment data at different values of the alignment parameters and 



a More precisely, we consider global alignments (cr — » 00) of sequences of length N with 
mutual correlations over their entire length (i.e., Q = Q and Q' = Q'). For N < t c , however, 
random agglomeration of matchesrtmtweigh the pairs of correlated elements, rendering the 
correlation undetectable. Sec Rcf. tj for details. 
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of the evolution parameters. The scaling functions tp and £ are universal as 
before, only their arguments t c /to depend on the parameters. This is crucial 
for finding optimal alignment parameters as we show in the next Section. 

The form of Eq. (0) has been verified numerically: Figs. 3(a) and 4(a) show 
the average fidelity and length of optimal alignments, respectively, for different 
values of 7 and a. The data for different parameter values are indeed related 
as is evident from the collapse of the scaled curves T ' jT*{j) and L/N; see 
Figs. 3(b) and 4(b). The scaled abscissa 8a/\8a*(-f)\ can be expressed in terms 
of the ratio of characteristic scales in (0), 8a/\8a*(-f)\ — (t c /ta) 2 ^ 3 , as follows 
from ([|) and the relation t c (j) ~ (8a* (7) / B(-f))~ z / 2 which is anticipated from 
a previous analysis Here, 8a* = a* (7) — 0^(7), and a* is the location of the 
relative fidelity maximum, defined from J 7 * (7) = J-(a*,j). The data collapse 
shown in Figs. 3 and 4 therefore supports the proposed scaling form (pi). 



4 Parameter dependence and optimization 

As the fidelity curves of Fig. 3(a) show, the quality of an alignment depends on 
the proper choice of both scoring parameters a and 7. The strong dependence 
of T on a can be understood by comparison with Fig. 4. The relative fidelity 
maximum T* (7) occurs at a value 8a* (7) < where the optimal alignment 
just covers the target (i.e., L — N). For 8a < 8a*, the optimal alignment 
is too short. For 8a > 8a*, the alignment "overshoots" the target, adding 
random matches to both sides and reducing its fidelity. As 8a f 0, the length 
L increases continuously to values of order N; that is, the optimal alignment 
becomes global. Our result that L w N when 8a = 8a* (Fig. 4) justifies the 
use of Eq. (|J) as a fidelity measure for local alignment. 

For real alignment applications with unknown sequence correlations, the 
fidelity is of course not accessible directly. What is readily accessible is the 
optimal score S of an alignment. Below, we describe how the fidelity maximum 
can be inferred from the score data. The key quantity to consider is the 
parameter dependence of the score ratio 

s(a, 1 ) = S/S . (8) 

As shown in Fig. 5 for alignment of a single pair of sequences, s attains its rel- 
ative maximum for fixed 7 at a value 5cr max (7) close to 8a* (7) and its absolute 
maximum at a value 7 max w 7*. More importantly, a comparison of Fig. 5(b) 
with Fig. 3(b) shows that the fidelity ^(a™ 1 ^, 7 max ) evaluated at the maximum 
of s is very close to the actual fidelity maximum T* . While the fidelity and 
score patterns fluctuate for individual sequence pairs, this relationship between 
their maxima turns out to be remarkably robust. Our results therefore suggest 
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Fig. 3: Fidelity of local alignments for piecewise correlated sequences, (a) The av- 
erage fidelity T of the optimal alignment for various values of the scoring scoring 
parameters, each averaged over an ensemble of 100 — 800 sequences pairs. The se- 
quences are of length N/2 = 10000; they contain mutually correlatecLsubsequences 
of length N /2 = 2000, which are related by Markovian evolution rules u with param- 
eters U = 0.3 and q = 0.25. (b) The scaled curves T / ' T*{p/) as functions of the scaled 
abscissa x = Sa/\Scr* collapse to the single scaling function tp(x 3 ^ 2 ) in accordance 
with Eqs. @ and (Q). 
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Fig. 4: Length of local alignments for piecewise correlated sequences, (a) The average 
length L of the optimal alignment for various scoring parameters, obtained from the 
sequence pairs of Fig. 3. (b) The scaled curves L/N as functions of the scaled abscissa 
x — 5a /\5<t* {^f)\ collapse to the single scaling function C{x 3 ^ 2 ) in accordance with 
Eqs. (^) and (Q). Note that at the point of maximal fidelity (x = — 1), the alignment 
length equals the target length, i.e., L/N ~ 1. 
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Fig. 5: Score of local alignments for piecewise correlated sequences, (a) The score 
ratio s — S/So of the optimal alignment for various scoring parameters, obtained 
from a single pair of the correlated sequences described in Fig. 3. (b) The scaled 
curves of s as functions of x = 8a/\8a*("f)\ have maxima in the high-fidelity region 
around the point 5er/\6cr*(j) \ = — 1; cf. Fig. 3(b). 
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that high-fidelity alignments can be obtained by maximizing the score ratio s. 
As can be seen from Fig. 5, the parameter dependence of s(er, 7) is given by a 
"mountain" with a well-defined local maximum. The location of the maximum 
(cr max , 7 max ), anc j h ence the location of the fidelity maximum, is accessible by 
a standard iterative procedure in a few steps. 

This link between fidelity and score data is expected from the scaling 
theory of alignment. For 7 « 7*, the score ratio takes the scaling form s — 
(N/t c )S(t c /t ) similar to (g). The relative maxima J r *( 7 ) and s max (7) = 
max CT s(ct, 7) are determined by the maxima of the scaling functions (p and <S, 
respectively. These are functions of the same variable r c = t c /to; their maxima 
are found to occur at values r* T ™ ax both of order 1. The lines 5<j*(j) and 
£cr max (7) are then given by the equations t c (ct, 7) = r* and t c (ct, 7) = r™ ax , 
respectively- The positions of the absolute maxima turn out to be_celated in a 
similar waya. A more detailed discussion will be given elsewhereEj. 

The optimization criterion can be reformulated in two ways: 

(i) The relative maxima of the score ratio define the function s max (7) = 
T maK N/t c ('y). Hence, the global maximum s max is obtained by minimizing 
the threshold length t c while keeping r t™, i.e., to of order t c . 

(ii) The threshold length t c is related to another important quantity, the score 
gain SE over uncorrelated sequences per aligned element in global alignments 
(see the detailed discussion in Drasdo et al.a). We have t c ~ _B 3 / 2 (7)(<5_E)~ 3 / 2 . 
By comparison with (||), it follows that the above optimization is equivalent 
to maximizing 5E while keeping \8a\ of order SE. 

5 Summary 

We have presented a conceptually simple and statistically well-founded proce- 
dure to optimize Smith- Waterman alignments. For given scoring parameters 
in the logarithmic phase, we compute (i) the score S of the optimal alignment 
and (ii) the average background score Sq obtained by randomizing the se- 
quences. The scoring parameters are then improved iteratively by maximizing 
the score ratio S/Sq. We have shown that this procedure produces alignments 
of high fidelity on test sequences. Efficient algorithmic implementations and 
applications to real biological sequences are currently being studied 113. 
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